Lesson 6

Welcome

Notes: See blog post for this rmd explanation: https://solomonmessing.wordpress.com/2014/01/19/visualization-series-the-scatterplot-or-how-to-use-data-so-you-dont-get-ripped-off/ ***

library('ggplot2')
data(diamonds)
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Scatterplot Review

qplot(x = carat, y = price, data = diamonds,
    xlim = c(0, quantile(diamonds$carat,0.99)),
    ylim = c(0, quantile(diamonds$price, 0.99)))+
  geom_point(fill = I('#F79420'), color = I('black'), shape = 21)
## Warning: Removed 926 rows containing missing values (geom_point).

## Warning: Removed 926 rows containing missing values (geom_point).

ggplot(aes(x = carat, y = price), data = diamonds) +
  geom_point(fill = I('#F79420'), color = I('black'), shape = 21) +
  #linear trim
  stat_smooth(method = 'lm') +
  #using 99 percentile
  xlim(0, quantile(diamonds$carat,0.99)) +
  ylim(0, quantile(diamonds$price, 0.99))
## Warning: Removed 926 rows containing non-finite values (stat_smooth).
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).


Price and Carat Relationship

Response: Non linear relationship, maybe exponential, maybe something else. Price increases with numbers of carats. If we paint a linear trend line, we can see that it doesn´t go through the center of the data in some key places. ***

Frances Gerety

Notes:

A diamonds is


The Rise of Diamonds

Notes:


ggpairs Function

Notes: Ggpairs provide: -lower triangle grouped histograms for qualitative-qualitative pairs (using the y as grouping factor) and scatter plots for quantitative-quantitative pairs -upper triangle grouped histograms for qualitative-qualitative pairs (using the x as grouping factor), boxplots for qualitative-quantitative pairs and correlation for quantitative-quantitative pairs.

# install these if necessary
#install.packages('GGally')
#install.packages('scales')
#install.packages('memisc')
#install.packages('lattice')
#install.packages('MASS')
#install.packages('car')
#install.packages('reshape')
#install.packages('plyr')

# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 3.4.1
library(scales)
## Warning: package 'scales' was built under R version 3.4.1
library(memisc)
## Warning: package 'memisc' was built under R version 3.4.3
## Loading required package: lattice
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.4.3
## 
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
## 
##     percent
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array
# sample 10,000 diamonds from the data set otherwise it will take too long
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
#ggpairs(diamond_samp, params = c(shape = I('.'), outlier.shape = I('.')))
ggpairs(diamond_samp,
  # to make the points smaller and clearer
  lower = list(continuous = wrap("points", shape = I('.'))),
  upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  #axisLabels = 'internal')

What are some things you notice in the ggpairs output? Response:


The Demand of Diamonds

Notes:

plot1 <- qplot(data = diamonds, x = price, binwidth = 100, fill = I('#099DD9')) +
  ggtitle('price')
plot2 <- qplot(data = diamonds, x = price, binwidth = 0.01, fill = I('#F79420')) +
  ggtitle('price (log10)') + 
  scale_x_log10()
plot3 <- qplot(data = diamonds, x = log10(price), binwidth = 0.01, fill = I('#F79420')) +
  ggtitle('log10(price)') 
  

library(gridExtra)
library(grid)
grid.arrange(plot1, plot2, plot3, ncol = 2)


Connecting Demand and Price Distributions

Notes:


Scatterplot Transformation

On the log scale prices look less dispersed at the high end of carat size and price.

qplot(carat, price, data = diamonds)+
  scale_y_continuous(trans = log10_trans())+
  ggtitle('Price (log10) by Carat')

Create a new function to transform the carat variable

We can do better (we are trying to get a linear representation)

# this is an R function. We need the inverse conversion too.
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)

Use the cuberoot_trans function

ggplot(aes(carat, price), data = diamonds) + 
  geom_point() + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).


Overplotting Revisited

Due mainly to rounding we have multiple points overlapped which obscures some of the density and sparsity of our data, the key points.

head(sort(table(diamonds$carat), decreasing = T))
## 
##  0.3 0.31 1.01  0.7 0.32    1 
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing = T))
## 
## 605 802 625 828 776 698 
## 132 127 126 125 124 121

You can make your points smaller, jittering them and adding some transparency.

# Add a layer to adjust the features of the
# scatterplot. Set the transparency to one half,
# the size to three-fourths, and jitter the points.

# If you need hints, see the Instructor Notes.
# There are three hints so scroll down slowly if
# you don’t want all the hints at once.

ggplot(aes(carat, price), data = diamonds) + 
  geom_point(alpha = 0.5, size = 0.75, position = 'jitter') + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).


Other Qualitative Factors

Notes:


Price vs. Carat and Clarity

Alter the code below.

# Adjust the code below to color the points by clarity.

# A layer called scale_color_brewer() has 
# been added to adjust the legend and
# provide custom colors.

# See if you can figure out what it does.
# Links to resources are in the Instructor Notes.

# You will need to install the package RColorBrewer
# in R to get the same colors and color palettes.

# install and load the RColorBrewer package
#install.packages('RColorBrewer')
library(RColorBrewer)

ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).


Clarity and Price

Response: For a fixed number of carats we can see that diamonds with lower clarity al almost always cheaper than diamonds with higher clarity.


Price vs. Carat and Cut

Alter the code below.

ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Cut', reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1696 rows containing missing values (geom_point).


Cut and Price

Response: Most of the data is from diamonds with ideal cut, but we do see that most of the fair cuts are in the lower prices for a given carat.


Price vs. Carat and Color

Alter the code below.

ggplot(aes(x = carat, y = price, color = color), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Color',
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).


Color and Price

Response:Color does seem to explain some variance in the price.


Linear Models in R

Notes: In R we can create models using the lm() function. x lm(x~y) with x the outcome variable and y the explanatory variable. Response: In our case x = log(price) y = carat ^ 1/3 (cube root) ***

Building the Linear Model

Notes: The I wrapper around the variables stands for as-is. In this case it tells R to use the expression inside the I function to tranform a variable before using it in the regression. This is instead of instructing R to interpret these symbols as part of the formula to construct the design matrix for the regression.

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
# updating the model
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
# mtable produces a table of estimates for several models.
mtable(m1, m2, m3, m4, m5, sdigits = 3)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamonds)
## 
## ============================================================================================
##                        m1             m2             m3             m4            m5        
## --------------------------------------------------------------------------------------------
##   (Intercept)          2.821***       1.039***       0.874***      0.932***       0.415***  
##                       (0.006)        (0.019)        (0.019)       (0.017)        (0.010)    
##   I(carat^(1/3))       5.558***       8.568***       8.703***      8.438***       9.144***  
##                       (0.007)        (0.032)        (0.031)       (0.028)        (0.016)    
##   carat                              -1.137***      -1.163***     -0.992***      -1.093***  
##                                      (0.012)        (0.011)       (0.010)        (0.006)    
##   cut: .L                                            0.224***      0.224***       0.120***  
##                                                     (0.004)       (0.004)        (0.002)    
##   cut: .Q                                           -0.062***     -0.062***      -0.031***  
##                                                     (0.004)       (0.003)        (0.002)    
##   cut: .C                                            0.051***      0.052***       0.014***  
##                                                     (0.003)       (0.003)        (0.002)    
##   cut: ^4                                            0.018***      0.018***      -0.002     
##                                                     (0.003)       (0.002)        (0.001)    
##   color: .L                                                       -0.373***      -0.441***  
##                                                                   (0.003)        (0.002)    
##   color: .Q                                                       -0.129***      -0.093***  
##                                                                   (0.003)        (0.002)    
##   color: .C                                                        0.001         -0.013***  
##                                                                   (0.003)        (0.002)    
##   color: ^4                                                        0.029***       0.012***  
##                                                                   (0.003)        (0.002)    
##   color: ^5                                                       -0.016***      -0.003*    
##                                                                   (0.003)        (0.001)    
##   color: ^6                                                       -0.023***       0.001     
##                                                                   (0.002)        (0.001)    
##   clarity: .L                                                                     0.907***  
##                                                                                  (0.003)    
##   clarity: .Q                                                                    -0.240***  
##                                                                                  (0.003)    
##   clarity: .C                                                                     0.131***  
##                                                                                  (0.003)    
##   clarity: ^4                                                                    -0.063***  
##                                                                                  (0.002)    
##   clarity: ^5                                                                     0.026***  
##                                                                                  (0.002)    
##   clarity: ^6                                                                    -0.002     
##                                                                                  (0.002)    
##   clarity: ^7                                                                     0.032***  
##                                                                                  (0.001)    
## --------------------------------------------------------------------------------------------
##   R-squared            0.924          0.935          0.939         0.951          0.984     
##   adj. R-squared       0.924          0.935          0.939         0.951          0.984     
##   sigma                0.280          0.259          0.250         0.224          0.129     
##   F               652012.063     387489.366     138654.523     87959.467     173791.084     
##   p                    0.000          0.000          0.000         0.000          0.000     
##   Log-likelihood   -7962.499      -3631.319      -1837.416      4235.240      34091.272     
##   Deviance          4242.831       3613.360       3380.837      2699.212        892.214     
##   AIC              15930.999       7270.637       3690.832     -8442.481     -68140.544     
##   BIC              15957.685       7306.220       3761.997     -8317.942     -67953.736     
##   N                53940          53940          53940         53940          53940         
## ============================================================================================

Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.


Model Problems

Video Notes: OUR MODEL IS (see column 5, the last model in the previous result)

log(price) = 0.415 + 9.144 * carat^(1/3) - 1.093 * carat….

Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)

Response:


A Bigger, Better Data Set

Notes:

#install.packages('bitops')
#install.packages('RCurl')
library('RCurl')
## Warning: package 'RCurl' was built under R version 3.4.3
## Loading required package: bitops
##library('bitops')

#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))
load("BigDiamonds.rda")

The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data

Building a Model Using the Big Diamonds Data Set

Notes:

# Your task is to build five linear models like Solomon
# did for the diamonds data set only this
# time you'll use a sample of diamonds from the
# diamondsbig data set.

# Be sure to make use of the same variables
# (logprice, carat, etc.) and model
# names (m1, m2, m3, m4, m5).

# To get the diamondsbig data into RStudio
# on your machine, copy, paste, and run the
# code in the Instructor Notes. There's
# 598,024 diamonds in this data set!

# Since the data set is so large,
# you are going to use a sample of the
# data set to compute the models. You can use
# the entire data set on your machine which
# will produce slightly different coefficients
# and statistics for the models.

# This exercise WILL BE automatically graded.

# You can leave off the code to load in the data.
# We've sampled the data for you.
# You also don't need code to create the table output of the models.
# We'll do that for you and check your model summaries (R^2 values, AIC, etc.)

# Your task is to write the code to create the models.
#
# DO NOT ALTER THE CODE BELOW THIS LINE (Reads in a sample of the diamondsbig data set)
#===========================================================================================
# diamondsBigSample <- read.csv('diamondsBigSample.csv')


# ENTER YOUR CODE BELOW THIS LINE. (Create the five models)
#===========================================================================================
diamondsbig$logprice = log(diamondsbig$price)
m1 <- lm(logprice ~ I(carat^(1/3)),
         data=diamondsbig[diamondsbig$price < 10000 &
                            diamondsbig$cert == "GIA",])
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = logprice ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m2: lm(formula = logprice ~ I(carat^(1/3)) + carat, data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m3: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut, data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m4: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == 
##         "GIA", ])
## m5: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == 
##     "GIA", ])
## 
## ================================================================================================
##                         m1              m2             m3             m4              m5        
## ------------------------------------------------------------------------------------------------
##   (Intercept)           2.671***        1.333***       0.949***       0.529***       -0.464***  
##                        (0.003)         (0.012)        (0.012)        (0.010)         (0.009)    
##   I(carat^(1/3))        5.839***        8.243***       8.633***       8.110***        8.320***  
##                        (0.004)         (0.022)        (0.021)        (0.017)         (0.012)    
##   carat                                -1.061***      -1.223***      -0.782***       -0.763***  
##                                        (0.009)        (0.009)        (0.007)         (0.005)    
##   cut: V.Good                                          0.120***       0.090***        0.071***  
##                                                       (0.002)        (0.001)         (0.001)    
##   cut: Ideal                                           0.211***       0.181***        0.131***  
##                                                       (0.002)        (0.001)         (0.001)    
##   color: K/L                                                          0.123***        0.117***  
##                                                                      (0.004)         (0.003)    
##   color: J/L                                                          0.312***        0.318***  
##                                                                      (0.003)         (0.002)    
##   color: I/L                                                          0.451***        0.469***  
##                                                                      (0.003)         (0.002)    
##   color: H/L                                                          0.569***        0.602***  
##                                                                      (0.003)         (0.002)    
##   color: G/L                                                          0.633***        0.665***  
##                                                                      (0.003)         (0.002)    
##   color: F/L                                                          0.687***        0.723***  
##                                                                      (0.003)         (0.002)    
##   color: E/L                                                          0.729***        0.756***  
##                                                                      (0.003)         (0.002)    
##   color: D/L                                                          0.812***        0.827***  
##                                                                      (0.003)         (0.002)    
##   clarity: I1                                                                         0.301***  
##                                                                                      (0.006)    
##   clarity: SI2                                                                        0.607***  
##                                                                                      (0.006)    
##   clarity: SI1                                                                        0.727***  
##                                                                                      (0.006)    
##   clarity: VS2                                                                        0.836***  
##                                                                                      (0.006)    
##   clarity: VS1                                                                        0.891***  
##                                                                                      (0.006)    
##   clarity: VVS2                                                                       0.935***  
##                                                                                      (0.006)    
##   clarity: VVS1                                                                       0.995***  
##                                                                                      (0.006)    
##   clarity: IF                                                                         1.052***  
##                                                                                      (0.006)    
## ------------------------------------------------------------------------------------------------
##   R-squared             0.888           0.892          0.899          0.937           0.969     
##   adj. R-squared        0.888           0.892          0.899          0.937           0.969     
##   sigma                 0.289           0.284          0.275          0.216           0.154     
##   F               2700903.714     1406538.330     754405.425     423311.488      521161.443     
##   p                     0.000           0.000          0.000          0.000           0.000     
##   Log-likelihood   -60137.791      -53996.269     -43339.818      37830.414      154124.270     
##   Deviance          28298.689       27291.534      25628.285      15874.910        7992.720     
##   AIC              120281.582      108000.539      86691.636     -75632.827     -308204.540     
##   BIC              120313.783      108043.473      86756.037     -75482.557     -307968.400     
##   N                338946          338946         338946         338946          338946         
## ================================================================================================
# DO NOT ALTER THE CODE BELOW THIS LINE (Tables your models and pulls out the statistics)
#===========================================================================================
suppressMessages(library(lattice))
suppressMessages(library(MASS))
suppressMessages(library(memisc))
models <- mtable(m1, m2, m3, m4, m5)

Predictions

Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601

#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
                         color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
                        interval="prediction", level = .95)
# We need to exponentiate the result of the model wince we took the log of price
exp(modelEstimate)
##        fit     lwr      upr
## 1 5040.436 3730.34 6810.638

Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.


Final Thoughts

Notes:


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!